Advanced Regression Assignment A

Zach Wolpe
WLPZAC001

01 June 2020


Raw Data

We wish to understand how body fat percentage of woman from \(3\) villages in West Africa changes with age. The woman’s bodyfat appears to have a nonlinear, heteroscedastic, relationship with their ages’. First, lets visualize the data, we will focus on modeling the log tricept data.


Cubic B-Spline Basis

We construct B-Spline basis over the range of the predictor variable (age) with \(20\) evenly spaced knots & \(3\) degrees of freedom.

Model the data with the unpenalized B-Spline basis, then visualize the results.

## 
## Call:
## lm(formula = y ~ basis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.925  -1.674  -0.271   1.181  22.405 
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.282      1.874  10.822  < 2e-16 ***
## basis1       -12.542      2.489  -5.040 5.68e-07 ***
## basis2       -12.604      2.241  -5.625 2.50e-08 ***
## basis3       -11.211      2.184  -5.132 3.53e-07 ***
## basis4       -13.456      2.078  -6.476 1.57e-10 ***
## basis5       -14.701      2.085  -7.052 3.59e-12 ***
## basis6       -13.637      2.071  -6.585 7.87e-11 ***
## basis7       -12.645      2.095  -6.036 2.34e-09 ***
## basis8       -10.894      2.167  -5.028 6.02e-07 ***
## basis9        -5.818      2.291  -2.540 0.011260 *  
## basis10      -10.446      2.419  -4.319 1.75e-05 ***
## basis11       -4.131      2.387  -1.731 0.083875 .  
## basis12       -9.185      2.306  -3.983 7.36e-05 ***
## basis13       -4.038      2.537  -1.591 0.111885    
## basis14       -6.821      2.511  -2.717 0.006719 ** 
## basis15       -2.749      2.434  -1.130 0.258962    
## basis16       -8.871      2.497  -3.552 0.000403 ***
## basis17       -3.504      2.541  -1.379 0.168201    
## basis18       -4.919      2.953  -1.666 0.096137 .  
## basis19      -12.826      3.404  -3.768 0.000175 ***
## basis20        9.728      3.822   2.545 0.011084 *  
## basis21      -23.468      4.481  -5.237 2.05e-07 ***
## basis22           NA         NA      NA       NA    
## basis23           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.673 on 870 degrees of freedom
## Multiple R-squared:  0.4555, Adjusted R-squared:  0.4423 
## F-statistic: 34.65 on 21 and 870 DF,  p-value: < 2.2e-16
## Warning in predict.lm(model, basis, interval = "confidence"): prediction from a
## rank-deficient fit may be misleading

## 
## Call:
## lm(formula = y ~ basis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14085 -0.17949  0.00128  0.17100  1.09477 
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9839     0.1568  19.036  < 2e-16 ***
## basis1       -0.9497     0.2081  -4.563 5.78e-06 ***
## basis2       -0.9598     0.1874  -5.122 3.73e-07 ***
## basis3       -0.7965     0.1827  -4.360 1.46e-05 ***
## basis4       -1.0632     0.1738  -6.118 1.43e-09 ***
## basis5       -1.2820     0.1743  -7.353 4.46e-13 ***
## basis6       -1.1133     0.1732  -6.427 2.14e-10 ***
## basis7       -1.0185     0.1752  -5.813 8.63e-09 ***
## basis8       -0.7537     0.1812  -4.159 3.51e-05 ***
## basis9       -0.3875     0.1916  -2.022  0.04343 *  
## basis10      -0.6459     0.2023  -3.193  0.00146 ** 
## basis11      -0.4034     0.1997  -2.020  0.04366 *  
## basis12      -0.5913     0.1929  -3.066  0.00223 ** 
## basis13      -0.2925     0.2122  -1.378  0.16844    
## basis14      -0.4128     0.2100  -1.966  0.04962 *  
## basis15      -0.2243     0.2035  -1.102  0.27084    
## basis16      -0.5788     0.2089  -2.771  0.00570 ** 
## basis17      -0.2563     0.2125  -1.206  0.22816    
## basis18      -0.2745     0.2470  -1.111  0.26674    
## basis19      -0.9317     0.2847  -3.273  0.00111 ** 
## basis20       0.8037     0.3196   2.515  0.01210 *  
## basis21      -1.9650     0.3748  -5.243 1.98e-07 ***
## basis22           NA         NA      NA       NA    
## basis23           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3072 on 870 degrees of freedom
## Multiple R-squared:  0.4972, Adjusted R-squared:  0.4851 
## F-statistic: 40.97 on 21 and 870 DF,  p-value: < 2.2e-16
## Warning in predict.lm(model, basis, interval = "confidence"): prediction from a
## rank-deficient fit may be misleading

The model appears to overfit the data, presenting a violent, jagged curve that bounces between datapoints.

It’s also noteworthy that the variation grows as a function of \(x=age\), it is clear to see the data becomes more widespread with age.

The model appears to overfit the data, more generalizable results may be achieved by penalizing the parameter coefficients.


P-Splines

Add a square difference penalty to the model to penalize large discrepencies in adjacent coefficients. The penalty:

\[ P =\sum_{i=1}^{k-1} (\beta_{i+1} - \beta_{i})^2 \]

Thus minimizing the penalized least squares:

\[PSS = ||y - X\beta||^2 + \lambda \beta`P\beta\]

Yielding the solution (for a given \(\lambda\)):

\[\beta = (X`X + \lambda P)^{-1} X`y\]

The PLS is computed for a given \(\lambda\) values, here we fit it for \(\lambda=10\). Here we learn the function by optimizing (minimizing) the \(PSS\) function.

This appears to be a more reliable fit to the data.

On the extreme cases:

  • \(\lambda=0\) the orignal unpenalized model is fitted
  • \(\lambda=\infty\) no deviation between consequtive parameters will be tolerated, thus fitting a straight line.

Using a very strong penalty (high lambda) forces the model to smooth.


Optimize Lambda

The function is heavily reliant on a suitable choice for \(lambda\), to find the optimal model, \(lambda\) should be be selected to minimize Generalized Cross Validation GCV (a computationally simple approximation for LOOCV).

GCV is computed as:

\[GCV(\hat{f}) = \frac{1}{N} \sum_{i=1}^N(\frac{y_i- \hat{f}(x_i)}{1 - trace(S)/N})^2\]

Where \(S\):

\[ \begin{equation} \begin{split} \hat{y} &= Sy \\ \hat{y} &= Sy \\ X\hat{\beta} &= Sy \\ X(X`X + λP)^{-1}X`y &= Sy \\ X(X`X + λP)^{-1}X` &= S \end{split} \end{equation} \]

Here I compute the model for various \(\lambda\) values to find the model that minimizes GCV (thus minimizes the average predictive error in the model).

## [1] "minimum GCV achieved at  λ= 0.0202020202020202"


Mødel A

Normally, the minimum GCV is taken without question, however in this specific instance, following a tiny decrease, GCV only seems to increase with \(λ\). This suggests a lambda value \(\lambda=.02020\) that minimizes GCV, which is extremely small. As such the fitted curve is almost identical to the unpenalized model. On visual expection it does not appear to be an optimal model. Here we fit the suggest \(\lambda=.02020\) model, Denote this model: Model A:

Again, on visual inspection, this model appears subpar. If we again examine the GCV scores over various \(\lambda\) values, it’s readily apparent that although GCV increases with \(\lambda\) it increases as such an insignificant rate that it may be plausable to consider all GCV scores within a range equivalent. If we examine GCV scores over a wider range:

\[ \lambda \in \{1:100 \}\] We observe that GCV scores remains relatively flat until \(K \approx 10\). ——————

Mødel B & C

Lets consider this Model C (\(lambda=10\)) & compare the two models.

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0 omitted from
## logarithmic plot
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "log" is not a graphical
## parameter

Note: \(\lambda\) is on log-scale so the increase in GCV as a function of \(\lambda\) is actually \(\approx\) linear, however \(10\) appears to be a good enough selection point for a change in GCV.

Model Comparison

For good measure lets include a third model inbetween the other two. Thus we’ll compare \(3\) models: - Model A: \(lambda=0.020202\) - Model B: \(lambda=4\) - Model C: \(lambda=10\)

Lets compare the three models with a few metrics, starting off by simple visualing them.

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Model \(B\) & \(C\) appear very similar. Model \(A\) - which is virtually unpenalized - appears to overfit the data.

Now lets compare models via a numerical approach.

Metrics

\(3\) metrics are used to compare models: a. Examination of Normality Assumption of Errors b. AIC c. BIC

Assumption of Normality

Residuals are plotted against theoretical qq plot to assess normality visually. If residuals deviates substantially from normally distributed around \(0\) this may indicate poor fit as there are strong signals ignored by the model.

AIC and BIC

AIC & BICa are relative metrics inwhich lower values are favourable. Given that \(k = effective \text{ } no. \text{ } parameters\)

Effective no. Parameters

Since we are dealing with GAMS we need to use the effective number of parameters (a funtion of penalization) for \(K\). We define this as:

\[Trace(S)\]

Where:

\[S = X(X'X + \lambda P)^{-1}X'\]

\[AIC = -2 log(L) + kn\]

\[ BIC = -2 log(L) + k \times log(n)\]

In order to compute AIC & BIC

  • Compute variance of error term \(\sigma^2\)
  • Compute log-likelihood of the model
  • Calculate AIC, BIC

The Log-Likelihood of a normal distribution is given by

Below we compute the AIC, BIC, effective number of parameters & the residuals normal qq plot:

## [1] "Mødel A Effect Df: 21.6845096428966"
## [1] "Mødel B Effect Df: 12.8832651561525"
## [1] "Mødel C Effect Df: 10.331613714355"

##        model      AIC       BIC
## 1 λ=0.020202 20028.89 -330.8594
## 2       λ=10 20067.34 -292.4064
## 3        λ=4 20060.67 -299.0839

Model Conclusion

Residuals

All of the models exhibit similar residuals diagnostics. In each model the residuals deviating from the assumption of normality to some degree, this is probabily a consequence of the nonlinear nature of the data. The results are so similar that it offers no grounds on which to distringuish the models.

Effective number of Parameters

The effective number of parameters warrents consideration as simpler models are preferred. This is simply a function of the severity of penalization.

BIC & AIC

BIC & AIC are virtually indistinguishable across the three models & over no base for comparison.

Selection

The models appear to perform similarly across metrics. Whilst model A boasts the smallest GCV (approxoimate LOOCV) it is only marginally smaller than the alternatives & the alternatives offer simpler, smoother fits. Model A’s lack of penalization also results in it’s fit becoming serverely volatile near the ends of the range of the dependent variable (a consequence of small samples sets near the extremes) making it unrealiable for extremely young or old individuals.

For these reasons - & the principle of parsimonious models - Model B or C should be selected.

Model B & C are virtually identical but model C is simpler (more penalized) & should be selected.